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Abstract 

Approximate matrix factorization techniques with both nonnegativity and orthogonality constraints, 
referred to as orthogonal nonnegative matrix factorization (ONMF), have been recently introduced and shown 
to work remarkably well for clustering tasks such as document classification. In this paper, we introduce two 
new methods to solve ONMF. First, we show mathematical equivalence between ONMF and a weighted variant 
of spherical fc-means, from which we derive our first method, a simple EM-like algorithm. Our second method 
is based on an augmented Lagrangian approach. Standard ONMF algorithms typically enforce nonnegativity 
for their iterates while trying to achieve orthogonality at the limit (e.g., using a proper penalization term or 
a suitably chosen search direction). Our method works the opposite way: orthogonality is strictly imposed 
at each step while nonnegativity is asymptotically obtained, using a quadratic penalty. Finally, we show that 
the two proposed approaches compare favorably with standard ONMF algorithms on both text and image 
datasets. 

1 Introduction 

We are interested in solving nonnegative matrix factorization (NMF) problems with additional orthogonality 
constraints. Given an m-by-71 nonnegative matrix M and a factorization rank k (with k < min{m,n}), NMF can 
be formulated as follows 

min \\M - UV\\% s.t. U > and V > 0, 

i.e., find an m-by-fc nonnegative matrix U and an k-by-n nonnegative matrix V such that M ~ UV. NMF has 
become a very popular dimensionality reduction technique, and has been used successfully in many applications, 
see, e.g., [21 IS] and the references therein. Adding the orthogonality constraint VV T — Ik leads to another problem 
called orthogonal nonnegative matrix factorization (ONMF) which has tight connections with data clustering (see 
Section [2]). In particular, empirical evidence suggests that this additional orthogonality constraint can improve 
clustering performance compared to standard NMF or k- means [5[ [TJ5]. Current approaches to solve ONMF 
problems are typically based on suitable modifications of the algorithms developed for the original NMF problem 
[8], [19] . They enforce nonnegativity of the iterates at each step, and strive to attain orthogonality at the limit 
(but never attain exactly orthogonal solutions). In fact, dealing with matrices that are both orthogonal and 
nonnegative is difficult because the combination of these two properties imposes a sparsity structure that confers 
a combinatorial aspect to the problem (directly related to clustering indicator matrices, see Section [5]), which is 
not easily handled by standard continuous optimization schemes. 

The paper is organized as follows. In Section [2j we analyze the relationship between ONMF and clustering 
problems and show that it is closely related to spherical fc-means. Based on this analysis, we develop an EM- 
like algorithm which features a rank-one NMF problem at its core. Section [3] introduces another algorithm to 
perform ONMF using an augmented Lagrangian and a projected gradient scheme, which enforce orthogonality at 
each step while obtaining nonnegativity at the limit. Finally, in Section 21 we experimentally show that our two 
new approaches perform competitively with standard ONMF algorithms on text datasets and on different image 
decomposition problems. 
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2 Equivalence of ONMF with a Weighted Variant of Spherical fc-means 

In this section, we briefly recall how NMF with an additional constraint is equivalent to a fundamental clustering 
technique: Euclidean fc-means [HE]. We then show that relaxing this constraint leads to ONMF, which is therefore 
not exactly equivalent to fc-means but rather to another problem closely related to spherical fc-means [2]. Based 
on this analysis, we propose a new EM-like algorithm to solve ONMF problems. 

2.1 Equivalence with Euclidean fc-means Let M = (mi, . . . , m n ) <G R™ xn be a nonnegative data matrix 
whose columns represent a set of n points {mj}™ =1 G M.™. Solving the clustering problem means finding a set 
{TTi}t =1 of fc disjoint clusters: 

TTi C {1,2, ...,ra} Vi, Ui<i< k TTi = {1,2, ...,n}, 
and 7r,j l~l TTj = Vi ^ j, 

such that each cluster TTj contains objects as similar as possible to each other according to some quantitative 
criterion. When choosing the Euclidean distance, we obtain the fc-means problem, which can be formulated as 
follows [6]: 
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where c, = — ^ — -. are the cluster centroids. 

\n\ 

Equivalently, we can define a binary cluster indicator matrix B g {0, lj kxn as follows: 

B = {bij}kxn where 6y = 1 <*=> j £ 7r*. 

Disjointness of clusters TTi means that rows of B are orthogonal, i.e., BB T is diagonal. Therefore we can normalize 
them to obtain an orthogonal matrix V — {vij}kxn — {BB T )~^ B (a weighted cluster indicator matrix) which 
satisfies the following condition: 

There exists a set of clusters {iTi}i =1 such that 

'-7f=r> if i e7r ^ 
^0, otherwise. 

It has been shown in [7] that the NMF problem with matrix V satisfying condition (fcTj) : 
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(2.1) min \\M-UV\\l s.t. V satisfies ETJ 

£/>0,V>0 



is equivalent to fc-means. In fact, since V in problem ()2.1|) is a normalized indicator matrix which satisfies 
Vij = | vr^ | ^ <^=J> j 6 7Tj, we have 
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which implies that, at optimality, each column itj of U must correspond (up to a multiplicative factor) to a cluster 
centroid with v>i — ^ [ /\^u\ci — ^ je,r ' 3 Vi = 1, . . . , fc. 



2.2 ONMF and a Weighted Variant of Spherical fc-means Let us now define a condition weaker than 



(c2) VV T = I k and V > 0. 

It can be easily checked that (|cT|) => (jc2"j) while (jc2"j) (jcTj) . The difference between conditions (|cT|) and (jc2"j) 
is that condition (lc2l does not require the rows of V to have their non-zero entries equal to each other. Now, 
if we only impose the weaker condition (|c2|) on NMF, we obtain a relaxed version of (|2.ip which, by definition, 
corresponds to orthogonal NMF: 

(2.2) min \\M ~ UV\\ 2 F such that VV T = I k . 

C/>0,V>0 

In the following, we show the equivalence of problem (|2.2p with a particular weighted variant of the spherical 
/c-means problem. 

It is well known that given a pair of solution matrices (U, V), one can find solutions with the same objective 
value ||M — by considering the pairs (UD^ 1 , DV), where D is any diagonal matrix with positive diagonal 

elements. Using this property, we can put the problem in a form where each column of matrix U has unit i^-norm. 
This simply amounts to taking D — diag(||ui||, . . ., ||u&||). Therefore, (|2.2[) is equivalent to 

(2.3) [/ min >o \\M - UV\\% s.t. {VV T ) ij = V* + 3, 

and ||Wt|| = IVi. 

Now, assuming we are given the set of non-zero entries of V (in the form of a partitioning {ni}^ =1 such 
that Vij ^ j € TTi) and the cluster directions Ui (with ||uj|| = 1 and Ui > 0), the optimal V can 
be computed in closed form. In fact, because rows of V are orthogonal to each other, we have, as before: 
||M — = Yli=i SjGTT; W m 3 ~~ u i v ij\\ 2 - F° r eacn term \\m,j — UiVij\\ 2 , the optimal v*j is given by: 



(2.4) u* = argmin \\rrij — UiX 
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rrij rrij — 2xm m + x j 

x>0 

= mjiii, 1 < i < k,j € 7Tj. 
Backsubstituting the optimal coefficients (|2.4p in (|2.3I) . we can rewrite problem (|2.3[) as 
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Since the terms mjuij are constants, we have that problem (|2.3p is finally equivalent to 
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It is insightful to compare formulation (12 .6[) of ONMF with the spherical fc-means problem [2] , which is a variant 



of fc-means where both data points and centroids are constrained to have unit norm: 

k 

r min * ll|i?TT _Ui ll 2 s -*- INI 2 = 1 > 

{^«>Ku I=1 jS7ri \\ m j\\ 

k m T 

(2.7) = max > > - — 3 —rrUi s.t. HliilU = 1. 

K-M* , r-f • \\ m i\\ 

Note that, in both problems (|2.6p and (|2.7|) . we are maximizing the cosines of the angles between u, and the data 
points from the corresponding cluster. However, we observe that: 

• Because of coefficients ||mi|| 2 , problem (|2.6[) is sensitive to the norm of the data points, as opposed to 
spherical fc-means (|2.7[) which only depends on their direction; 

• Even for normalized data points (i.e., \\mi\ \ — 1 Vi), problem (|2.6p is similar but not equivalent to spherical 
fc-means (|2.7[) because it tries to maximize the sum of squares of the cosines (instead of their sum) . 

• Contrarily to problem (|2.6[) . spherical fc-means (|2.7p does not require nonnegativity of Uj's, although it will 
clearly hold at optimality when data points rrij are nonnegative. 

To summarize, we have the following result: 

Theorem 2.1. For a nonnegativ^ data matrix M, the ONMF problem (|2.2|) is equivalent to the weighted variant 
of spherical k -means (|2.6p . 

To illustrate the differences between these different clustering techniques, Figure 1 displays a comparison 
between fc-means, standard spherical fc-means and ONMF. 

2.3 EM-like Algorithm for ONMF We present here a simple EM-like alternating algorithm designed to 
solve the ONMF problem (|2.2|) based on its equivalence with the weighted variant of spherical fc-means (|2.6[) . It 
is very similar to the standard spherical fc-means algorithm [2], except for the computation of cluster centroids. 
Specifically, it starts with an initial set of centroids, either randomly chosen or supplied as initial values. It then 
alternates between two steps: 

1. Given cluster centroids {ui}^ =1 , choose {7ri}* =1 assigning each point to its closest cluster: 

I T \ 2 

j G 7T; i = argmax (m ■ un 

l<£<k 

— argmax (mjut) . 
i<e<k 

Notice that this step is exactly equivalent to the one of standard spherical fc-means [5] • 

2. Given the clustering {7Ti}f =1 , compute the new optimal cluster centroids {ui}^ =1 as follows. Define matrix 
Mi G ]R mx l' ri l as the submatrix of M containing the columns belonging to cluster iri. We have to solve 
problem [|2.5p with respect to the Mi's: 

k k 

ff 18 ? ,» mi (mjui) =Y\\\Mfui\\l. 
{ Ui >o, K =l>J-i , r-^ v J t-^ 

There are fc independent problems: each Uj must maximize the term HM^UiH 2 ,. The optimal solution u* is 
given by the dominant left singular vector of M, associated with the largest singular value a±(Mi) of M,: 

Uj = argmax \\M i u\\ 2 — argmaxu MiM i u, 
ll«lh=i Il«ll3=i 

for which we have \ \Mfu*\\ 2 = o x (Mi) = ||Afj||2. Moreover, since Mi > 0, the Perron-Frobenius theorem 
guarantees that u* can always be chosen to be nonnegative. 

i The result also holds for any data matrix M (not necessarily nonnegative) if we remove the nonnegativity constraints on matrix 
U in the ONMF problem 112.21 1 and on vectors Ui in problem 1 12.611 . 
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Figure 1: Comparison of A:- means, standard spherical fc-means and ONMF. Diamonds are cluster centroids found 
by fc-means, continuous lines are spherical fc-means centroid directions while dashed lines are ONMF centroid 
directions. Circles and squares are data points as clustered by ONMF. As expected, fc-means is not sensitive to 
the alignment of the clusters as opposed to spherical fc-means and ONMF. On the left figure (a), the clusters 
are well separated and the three techniques perform similarly. On the right figure (b), the directional effect is 
clearly visible for both ONMF and spherical fc-means. However, there is an important difference between the two: 
ONMF is more sensitive to the data points with larger norm, while spherical fc-means treats all the points the 
same way (including the ones from the lower left cluster with smaller norm but wider angular distribution) and 
its centroids are therefore further apart from each other. 



Algorithm [T] referred to as EM-ONMF, implements this procedure. We will see in the last section that, despite 
its simplicity, it seems to work well for text clustering tasks. 

It is interesting to relate this with the original ONMF problem (|2.3|) : given a partitioning {7ri}^ =1 , let us 
denote Wi = {vij )j^ i the subvector containing only the positive entries of the i th row of V. Then, 

k 

\\M-uv\\ 2 F =j2Y,\\ m J- UiV v\\ 2 

i=l j£z7Ti 

k 

= ^2 \\ M i - UiUlf |||, 
i=l 

so that the optimal (ui,Wi) must be an optimal solution of 

(2.8) min \\Mi - Ui wf \\ 2 F . 

||Tt 4 ||=X T t**>0,TU < >Q 

Each of these problems looks for the best nonnegative rank-one approximation of a nonnegative matrix (i.e., 
a rank-one NMF problem). This in turn can be solved by combining the Eckart- Young and Perron- Frobenius 
theorems: taking the first rank-one factor generated by the singular value decomposition (SVD) (making sure it is 
nonnegative in case of non-uniqueness) leads to a minimum value for (I2.8|) equal to ||Mj|||, — a\{M{). Therefore, 
solving ONMF amounts to finding a partitioning {ni}^ =1 such that the sum of squares of the first singular values 
of submatrices Mj's is maximized, i.e., the ONMF problem (12.21) is equivalent to max^jt ^ 5Z»=i a i{^i)- 

3 Augmented Lagrangian Method for ONMF 

In this section, we present an alternative approach to solve ONMF problems. Typically, ONMF algorithms strictly 
enforce nonnegativity for each iterate while trying to achieve orthogonality at the limit. This can be done using 



Algorithm 1: EM-like Algorithm for ONMF (EM-ONMF) 



input : Nonnegative data matrix M, and initial centroids {ui}^ =1 . 

output: Clustering of the points {7rj}^ =1 , with the corresponding centroid directions {ui}\ =1 . 
while not converged do 

for j <— 1 to n do 
| find i — argmax 1<£<fc (mjue) and update cluster 7Tj = 7Ti U {j}. 
end 

if TTi — for some i then randomly transfer a point to cluster 7Tj end 
for i <— 1 to do 

| Ui <— (any) nonnegative dominant singular vector of the data submatrix Mj = M(:,7Tj). 
end 
end 



a proper penalization term [5] , a projection matrix formulation j!9j or by choosing a suitable search direction [5] . 
We propose here a method working the opposite way: at each iteration, a (continuous) projected gradient scheme 
is used to ensure that the V iterates are orthogonal (but not necessarily nonnegative). 

Nonnegativity constraints in the ONMF formulation (I2.2[) will be handled using the following augmented 
Lagrangian, defined for a matrix of Lagrange multipliers A g R^ x ™ associated to the nonnegativity constraints: 

(3.9) L P (U, V, A) = i||M - UV\\% + (A, -V) + ||| min(V, 0)|||, 

where p is the quadratic penalty parameter. Ideally, we would like to solve the Lagrangian dual 

max f (A) where f(A) = min L (U,V,A). 

Function /(A) is concave and its maximization (over a convex set) is then a convex problem, see, e.g., |17j . 
However, evaluating /(A) exactly (i.e., computing an optimal pair i7*(A) and V*(A)) is non-trivial and we 
propose here instead a simple alternating scheme to update variables U , V and A: 

1. For V and A fixed, the optimal U can be computed by solving a nonnegative least squares problem 
U 4— argmin^ gR mxfc \\M — XV\^ F . We use the efficient active-set method proposed i iH [15] . 

2. For U and A fixed, we update matrix V by means of a projected gradient scheme. Computing the projection 
of a matrix V onto the feasible set of orthogonal matrices, known as the Stiefel manifold_|, amounts to solving 
the following problem: 

Proj S4 (f) = argmin||F-X||| s.t. XX T = I k 
x 

whose optimal solution X* can be computed in closed form from the unitary factor of a polar decomposition 
of V, see, e.g., [T2J[I]. Our projected gradient scheme then reads: 

V <- Proj 5t (y - fiV v L p {U, V, A)) , 

where the step length j3 is chosen with a backtracking line search similar to that in [14] (step length is 
increased as long as there is a decrease in the objective function, and decreased otherwise). 

3. Finally, the Lagrange multipliers are updated in order to penalize the negative values of V: 

A <- max(0, A - aV) , 

2 Available at http://www.cc.gatech.edu/~hpark/ 

3 The Stiefel manifold is the set of all n X k orthogonal matrices, i.e., St(fe,n) = {X e R nxk : X T X = I k }. 



where a is a predefined sequence of step lengths decreasing to zero (e.g., a = oto/t where t is the iterations 
count and ao > is a constant parameter). This can be recognized as an (approximate) subgradient-type 
scheme [17] (in fact, one can check that V* is a subgradient of / at A when /(A) = L P (U* , V* , A)). 

To initialize the algorithm, we set A to zero and choose for the columns of V the first k right singular vectors 
of the data matrix M (which can be obtained with SVDjf). Quadratic penalty parameter p is initially fixed to 
a given small value po and then increased after each iteration. Alg. [2] implements this procedure, which we 
refer to as Orthogonal Nonnegatively Penalized Matrix Factorization (ONP-MF). We observed that the term 



Algorithm 2: Orthogonal nonnegatively penalized matrix factorization (ONP-MF) 

input : A nonnegative data matrix M, the number of clusters fc, ao > 0, po > and C > 1. 
output: The centroid matrix U, and the cluster assignment matrix V. 

Initialize A<°) = 0, the row E of with the first k right singular vectors of M, and p = pa- 
for t = 1,2,... do 

Update J/W with the optimal solution U* — argmin (7>0 ||M — UV^^^Wp. 
Update with projected gradient and a line search for step (3^: 

V® <- Proj Sf M*- 1 ) - /3«W£ P (E/ (t) , A( * _1) ) 

Update Lagrange multipliers (using an approximate subgradient): 

AW^max^A^ 1 )-^')). 



Update p <— Cp, where C > 1 is a constant parameter. 



end 



|| min(U, 0)| |f decreases linearly to zero (as augmented Lagrangian methods are expected to, see [15l Th. 17.2 
while ||M— C/U||f converges to a fixed value, see Figure [2] for an example on the Hubble dataset (cf. Section [472 
A rigorous convergence proof is a topic for further research. 
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Figure 2: Convergence of Alg.[2]for the Hubble dataset (left: constraint residual, right: approximation error). 



4 To overcome the sign ambiguity of each row of V^°\ we flip its sign if the £2-norm of its negative entries is larger than the ^2-norm 
of its positive entries. 



4 Numerical Experiments 

In this section, we report some preliminary numerical experiments showing that ONP-MF (Alg. [5]) and EM- 
ONMF (Alg. [T]) perform competitively with two recently proposed methods for ONMF: CHNMF from Choi [5] 
and PNMF from Yang and Oja [19] (Euclidean variant). It should be noted that because ONP-MF is initialized 
with SVD, all results are deterministic and obtained with just one execution of the algorithm. However, it could 
be argued that the comparison is not completely fair since the other ONMF algorithms (namely CHNMF and 
PNMF) are initialized with randomly generated factors. In order to perform a fairer comparison, we then also 
initialize CHNMF and PNMF with an SVD-based initialization [4] (SVD cannot be used directly because its 
factors are not necessarily nonnegative), which will be denoted CH(SVD) and P(SVD) respectively. Finally, 
we also report the results from two standard EM clustering algorithms, namely /c-means and spherical /c-means 
(SKM) (see, e.g., [2 ). We will see that EM-ONMF is quite efficient for text clustering tasks (see Section I4TT1 
while ONP-MF gives very good results for unsupervised image classification tasks (see Sections 14.21 and 14.31) . 

Parameters for ONP-MF are chosen as follows: ao = 100, po = 0.01 and C — 1.01 for all datasets. 
ONMF algorithms are run until a stopping condition is met, or a maximum of 5000 iterations in case of random 
initializations (for CHNMF and PNMF) and 20000 iterations for the SVD-based initialization (as done in [4]) was 
reached. The following stopping condition for CHNMF seems to work well in practical 

| ||m - U^V^Wf - \\M ~ U^V^\\ F \ 7 
\\M\\ F <W ' 

where t is the iteration count. For PNMF, we use the stopping criterion suggested by its author^: 

\\v«-v-yU\\ F 5 
\\v«-v\\ F ■ 

For ONP-MF, we check whether the current iterate is 'sufficiently' nonnegative, using 

'|nun(V,0) \\ F „ in _ 3 



< 10" 

All EM-likc algorithms, EM-ONMF included, were run until cluster assignment did not change for two consecutive 
iterations. The initial centroids were randomly selected among the data points. For each experiment, a number 
of 30 repetitions was executed in random conditions both for ONMF and EM-like algorithms. In the image 
experiments, we will display the best solution obtained (w.r.t. the error) among the 30 solutions hence obtained. 
All experiments were run on an Intel® Core™2 Duo CPU @2.40GHz with 6GB of RAM. 

4.1 Text clustering We selected nine well-known preprocessed document databases described in [20]. Each 
dataset is represented by a term- by-document matrix of varying characteristics, see Table [T] As a performance 
indicator, we use the accuracy: given a clustering {TTi}^ =1 and the true classes {Li}^ =1 of the n elements of the 
dataset, it is defined by: 



Accuracy = max — I 'S^ \%i n | J € [0,1], 



1 

Pe[i,2,...,fe] n 



where [1,2,..., k] is the set of permutations of {1, 2, . . . , k}. We report the average value of the obtained accuracy 
in Table [2j For more than half of the datasets the average best result was achieved by our algorithms, either 
EM-ONMF or ONP-MF. Moreover, our algorithms always obtain the best performance among ONMF algorithms. 
Table [3] reports the average number of iterations of each method, and Table 0] the average computational time. 
While EM-ONMF is very fast with a low number of iterations (as the other EM-like algorithms), ONP-MF is in 
general slower than the other ONMF algorithms (especially CHNMF) , and typically performs a larger number of 
iterations to converge. A similar behavior will be observed on the image decomposition tasks (see Table [5] and [5]) . 



B tt seems that 10 — 7 is a good trade-off: for example, using 10 — 8 instead leads to much larger computational times without 
significant improvements. 

7 Code available at http://users.ics.tkk.fi/rozyang/primf/index.html 



Table 1: Text mining datasets [20] . 



Data 


m 


n 
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^nonzero 


classic 


7094 


41681 


4 


223839 


ohscal 


11162 


11465 


10 


674365 


hitech 


2301 


10080 


6 


331373 


reviews 


4069 


18483 


5 


758635 


sports 


8580 


14870 


7 


1091723 


lal 


3204 


31472 


6 


484024 


la2 


3075 


31472 


6 


455383 


klb 


2340 


21839 


6 


302992 



Table 2: Average accuracy obtained by the different algorithms (in bold, best performance; underlined, second 
best). 



Dataset 


fc-means 


SKM 


CHNMF 


CH(SVD) 


PNMF 


P(SVD) 


EM-ONMF 


ONP-MF 


classic 


0.620 


0.577 


0.544 


0.559 


0.536 


0.547 


0.578 


0.538 


ohscal 


0.281 


0.418 


0.339 


0.339 


0.342 


0.338 


0.387 


0.340 


hitech 


0.318 


0.484 


0.427 


0.458 


0.414 


0.485 


0.494 


0.470 


reviews 


0.456 


0.678 


0.503 


0.494 


0.528 


0.533 


0.655 


0.510 


sports 


0.402 


0.466 


0.435 


0.430 


0.491 


0.489 


0.496 


0.500 


lal 


0.352 


0.473 


0.504 


0.444 


0.583 


0.634 


0.503 


0.658 


la2 


0.336 


0.476 


0.446 


0.422 


0.466 


0.508 


0.463 


0.528 


klb 


0.707 


0.657 


0.738 


0.606 


0.746 


0.757 


0.735 


0.790 



4.2 Hyperspectral Unmixing A hyperspectral image is a set of images of the same object or scene taken 
at different wavelengths. Each image is acquired by measuring the reflectance (i.e., the fraction of incident 
electromagnetic power reflected) of each individual pixel at a given wavelength. The aim is to classify the pixels 
in different clusters, each representing a different material. We want to cluster the columns of a wavelength-by- 
pixel reflectance matrix so that each cluster (a set of pixels) corresponds to a particular type of material. 

4.2.1 Hubble Telescope We first use a synthetic dataset from [16], see Figure[3](top row), in clean conditions 
(i.e., without noise or blur). It represents the Hubble telescope and is made up of 8 different materials, each 
having a specific spectral signature. Figure [3] displays the clustering obtained by the different algorithmt|j and 



M For EM-ONMF, fc-means and SKM we preprocess the data by discarding pixels from the background (i.e., all columns of the 
input matrix with zero ^2-norm). Recall that, for each algorithm, we keep the best solution (w.r.t. the error) among the 30 randomly 
generated initial matrices. 



Table 3: Average number of iterations for the different algorithms in the text clustering task. 



Dataset 


A;- means 


SKM 


CHNMF 


CH(SVD) 


PNMF 


P(SVD) 


EM-ONMF 


ONP-MF 


classic 


17 


26 


254 


401 


1557 


1841 


16 


2272 


ohscal 


44 


29 


354 


955 


3331 


1729 


28 


2651 


hitech 


18 


21 


297 


619 


2752 


2098 


20 


2476 


reviews 


20 


15 


121 


149 


958 


859 


13 


2581 


sports 


31 


33 


462 


882 


1777 


3452 


23 


2662 


lal 


20 


21 


272 


668 


3197 


2550 


22 


2556 


la2 


19 


20 


184 


685 


2670 


3309 


19 


2541 


klb 


13 


19 


182 


39 


1816 


1846 


12 


2434 



Table 4: Average running time in seconds for the different algorithms in the text clustering task. 



Dataset 


A:- means 


SKM 


CHNMF 


CH(SVD) 


PNMF 


P(SVD) 


EM-ONMF 


ONP-MF 


classic 


1.5 


0.3 


4 


7 


31 


38 


12 


216 


ohscal 


16.0 


1.1 


28 


73 


349 


171 


24 


565 


hitech 


2.4 


0.2 


7 


14 


83 


GO 


7 


157 


reviews 


5.8 


0.4 


5 


6 


54 


48 


10 


294 


sports 


15.1 


1.5 


34 


GG 


184 


358 


24 


478 


lal 


4.2 


0.5 


11 


27 


161 


130 


19 


359 


la2 


3.8 


0.5 


7 


2G 


130 


161 


15 


310 


klb 


1.8 


0.3 


5 


1 


66 


67 


6 


244 



Table 5: Average number of iterations for the different algorithms in image decomposition tasks. 



Dataset 


A:- means 


SKM 


CHNMF 


CH(SVD) 


PNMF 


P(SVD) 


EM-ONMF 


ONP-MF 


Hubble 


19 


15 


1025 


423 


241 


20000 


17 


2209 


Urban 


59 


66 


2209 


2645 


5000 


20000 


77 


3896 


Swimmer 


17 


20 


560 


187 


164 


20000 


15 


1531 



we observe that only ONP-MF is able to successfully recover all eight materials without any mixing. Even with 
the SVD-based initialization, CHNMF and PFNMF (i.e., CH(SVD) and P(SVD)) are not able to separate all 
materials properly; ONP-MF is the only algorithm able to perform this task (almost) perfectly. 

4.2.2 Urban Dataset The Urban hyperspectral image is taken from HYper-spectral Digital Imagery Collection 
Experiment (HYDICE) air-borne sensors. It contains 162 clean bands, and 307 x 307 pixels for each spectral 
image; it is mainly composed of 6 types of materials: road, dirt, trees, roofs, grass and metal (mostly metallic 
rooftops) as reported in [TTJ [TU] . The first row of Figure 0] displays a very good clustering obtained using N- 
FINDR5 |TS] plus manual adjustment from [11], along with the clusterings obtained with the different algorithms. 
The road and dirt are difficult to extract because their spectral signatures are similar (up to a multiplicative 
factor), and none of the algorithms is able to separate them perfectly. ONP-MF successfully extracts the grass, 
trees, and roofs and is the only algorithm able to extract the metal (second basis element), while only mixing the 
road and dirt together. Spherical fc-means, CHNMF, P(SVD) (Figured]) and EM-ONMF also perform relatively 
well, being able to extract the road (mixed with dirt or metal), trees, grass (as two separate basis elements) and 
roofs. CH(SVD) and fc-means perform relatively poorly: they are not able to separate as many materials properly. 

4.3 Image Segmentation: Swimmer Dataset The swimmer image dataset consists of 256 binary images 
of a body with 4 limbs which can be each in 4 different positions. The goal is to find a part-based decomposition 
of these images, i.e., isolate the different constitutive parts of the images (the body and the limbs, 17 in total). 
Moreover, these parts are not overlapping, and therefore no rows of V can share non-zero entries in the same 
column, and ONMF is the appropriate model. Figure [6] displays the basis elements obtained with the different 
ONMF algorithms. It can be observed that, in this case, the SVD-based initialization is of no benefit, neither for 
CHNMF nor for PNMF. All algorithms are able to successfully find the correct parts except PNMF, CH(SVD) 
and P(SVD). 



Table 6: Average running time in seconds for the different algorithms in image decomposition tasks. 



Dataset 


A;- means 


SKM 


CHNMF 


CH(SVD) 


PNMF 


P(SVD) 


EM-ONMF 


ONP-MF 


Hubble 


0.3 


0.07 


43 


6.0 


220 


445 


2.8 


106 


Urban 


42 


14 


684 


267 


1582 


2839 


493 


1174 


Swimmer 


0.1 


0.09 


5.4 


0.2 


1.7 


28 


2 


14 




Figure 3: Hubble dataset decomposition. From top to bottom: sample images at different wavelengths along with 
the true constituent materials; £;-means, spherical £;-means, CHNMF, CH(SVD), PNMF, P(SVD), EM-ONMF 
and ONP-MF. 



5 Conclusion 

In this paper, we have studied the ONMF problem and showed its equivalence with a weighted variant of spherical 
k- means. This led us to design a new EM-like algorithm for solving ONMF problems (Alg. [TJ. We have also 
proposed an alternative approach based on an augmented Lagrangian method imposing orthogonality at each 
step while relaxing the nonnegativity constraint (Alg. [5]). We finally showed on some text and image datasets 
that these new techniques compare favorably with existing ONMF algorithms. Our ONP-MF algorithm is by 
far the most robust: it always gave very good results, the best in many cases, using only one initialization. 
In particular, a single (deterministic) run of ONP-MF worked better in all image experiments than the other 
algorithms, despite the fact that they were allowed to keep the best solution obtained from 30 different (random) 
initializations. Further work includes improving its computational efficiency, on which we are currently working 
using more sophisticated optimization techniques. 
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Figure 4: Urban dataset decomposition. From top to bottom: 'true' materials, £;-means, spherical /c-means, 
CHNMF, PNMF, EM-ONMF and ONP-MF. 
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Figure 5: Urban dataset decomposition. From top to bottom: CH(SVD) and P(SVD). 
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Figure 6: Swimmer dataset decomposition. From top to bottom: /c-means, spherical /c-means, CHNMF, 
CH(SVD), PNMF, P(SVD), EM-ONMF and ONP-MF. 



